Me: Chris Hausler

Today - Pandas and Scikit-Learn

And a lot of firsts

  • first MPUG meeting... Hi
  • first presentation using IPython Notebook

Python Data Analysis Library (pandas)

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Creates somthing similar to R DataFrames.. but better

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss

Some imports


In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib 
%matplotlib inline
pd.__version__


Out[2]:
'0.13.1'

pandas has two main data structures: Series and DataFrame

Series - Like a one dimensional array but better


In [3]:
values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals


Out[3]:
0    5
1    3
2    4
3    8
4    2
5    9
dtype: int64

Each value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.


In [4]:
vals.index


Out[4]:
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [5]:
vals.values


Out[5]:
array([5, 3, 4, 8, 2, 9])

In [6]:
vals * 2.5


Out[6]:
0    12.5
1     7.5
2    10.0
3    20.0
4     5.0
5    22.5
dtype: float64

We can give named indexes


In [7]:
vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2


Out[7]:
tom         5
sally       3
jeff        4
george      8
pablo       2
florence    9
dtype: int64

And use these to get the data we want


In [8]:
vals2[['florence','tom']]


Out[8]:
florence    9
tom         5
dtype: int64

In [9]:
vals2[['florence','tom','kate']]


Out[9]:
florence     9
tom          5
kate       NaN
dtype: float64

Dealing with missing values


In [10]:
vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3


Out[10]:
tom          5
sally        3
pablo        2
florence     9
ricky      NaN
katrin     NaN
dtype: float64

Get rid of them


In [11]:
vals3.dropna()


Out[11]:
tom         5
sally       3
pablo       2
florence    9
dtype: float64

Fill them with a value


In [12]:
vals3.fillna(0)


Out[12]:
tom         5
sally       3
pablo       2
florence    9
ricky       0
katrin      0
dtype: float64

Fill them with a calculated value


In [13]:
vals3.fillna(vals3.mean())


Out[13]:
tom         5.00
sally       3.00
pablo       2.00
florence    9.00
ricky       4.75
katrin      4.75
dtype: float64

Use a function like forward fill


In [14]:
vals3.fillna(method='ffill')


Out[14]:
tom         5
sally       3
pablo       2
florence    9
ricky       9
katrin      9
dtype: float64

A handy way to get a picture of our data


In [15]:
vals3.describe()


Out[15]:
count    4.000000
mean     4.750000
std      3.095696
min      2.000000
25%      2.750000
50%      4.000000
75%      6.000000
max      9.000000
dtype: float64

DataFrame - Like a 2D array... with bells and whistles


In [16]:
vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]

In [17]:
# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat


Out[17]:
new orig
billy NaN NaN
florence 9 8
katrin NaN 9
pablo 2 4
ricky NaN 2
sally 3 3
tom 5 5

7 rows × 2 columns

Check for nulls


In [18]:
dat.isnull()


Out[18]:
new orig
billy True True
florence False False
katrin True False
pablo False False
ricky True False
sally False False
tom False False

7 rows × 2 columns

Drop rows with nulls


In [19]:
dat.dropna()


Out[19]:
new orig
florence 9 8
pablo 2 4
sally 3 3
tom 5 5

4 rows × 2 columns

Timeseries with pandas DataFrames - a winning combination

Data from google trends.. what correlates (+ve & -ve) with the search term Hipster

Read hipster correlations from a csv file

Pandas supports many file formats for read and write including

  • csv
  • json
  • pickle
  • the clipboard

In [20]:
hipster = pd.read_csv('hipster.csv')
hipster[:10]


Out[20]:
Date hipster modcloth gumtree perth
0 2004-01-04 -0.976 -0.817 -0.844
1 2004-01-11 -0.816 -0.817 -0.844
2 2004-01-18 -0.837 -0.817 -0.844
3 2004-01-25 -0.976 -0.817 -0.844
4 2004-02-01 -0.722 -0.817 -0.844
5 2004-02-08 -0.795 -0.817 -0.844
6 2004-02-15 -0.723 -0.817 -0.844
7 2004-02-22 -0.713 -0.817 -0.844
8 2004-02-29 -0.786 -0.817 -0.844
9 2004-03-07 -0.675 -0.817 -0.844

10 rows × 4 columns

Set the index to a datetime


In [21]:
hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]


Out[21]:
hipster modcloth gumtree perth
2004-01-04 -0.976 -0.817 -0.844
2004-01-11 -0.816 -0.817 -0.844
2004-01-18 -0.837 -0.817 -0.844
2004-01-25 -0.976 -0.817 -0.844
2004-02-01 -0.722 -0.817 -0.844
2004-02-08 -0.795 -0.817 -0.844
2004-02-15 -0.723 -0.817 -0.844
2004-02-22 -0.713 -0.817 -0.844
2004-02-29 -0.786 -0.817 -0.844
2004-03-07 -0.675 -0.817 -0.844

10 rows × 3 columns

Now load the anti-Hipster data


In [22]:
not_hipster = pd.read_csv('negative-hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))

In [23]:
not_hipster[:10]


Out[23]:
yellow pages windows installer techno
2004-01-04 1.341 0.668 0.871
2004-01-11 1.239 1.000 1.122
2004-01-18 1.022 0.768 1.053
2004-01-25 0.923 0.943 0.807
2004-02-01 0.904 0.799 0.612
2004-02-08 0.786 0.613 0.614
2004-02-15 0.729 0.956 0.391
2004-02-22 0.537 0.667 1.124
2004-02-29 0.534 1.415 1.078
2004-03-07 0.229 0.220 1.918

10 rows × 3 columns

Check the values of one column


In [24]:
hipster.hipster.head()


Out[24]:
2004-01-04   -0.976
2004-01-11   -0.816
2004-01-18   -0.837
2004-01-25   -0.976
2004-02-01   -0.722
Name: hipster, dtype: float64

Check another, but get them as an numpy.ndarray


In [25]:
hipster['gumtree perth'].values[:20]


Out[25]:
array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844])

View the data types, they don't need to be homogenous


In [26]:
hipster.dtypes


Out[26]:
hipster          float64
modcloth         float64
gumtree perth    float64
dtype: object

Joins on indexes are easy!


In [27]:
trend = hipster.join(not_hipster, how='inner')
trend.head()


Out[27]:
hipster modcloth gumtree perth yellow pages windows installer techno
2004-01-04 -0.976 -0.817 -0.844 1.341 0.668 0.871
2004-01-11 -0.816 -0.817 -0.844 1.239 1.000 1.122
2004-01-18 -0.837 -0.817 -0.844 1.022 0.768 1.053
2004-01-25 -0.976 -0.817 -0.844 0.923 0.943 0.807
2004-02-01 -0.722 -0.817 -0.844 0.904 0.799 0.612

5 rows × 6 columns

We can check the column names and values


In [28]:
trend.columns


Out[28]:
Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'windows installer', u'techno'], dtype='object')

In [29]:
trend.values


Out[29]:
array([[-0.976, -0.817, -0.844,  1.341,  0.668,  0.871],
       [-0.816, -0.817, -0.844,  1.239,  1.   ,  1.122],
       [-0.837, -0.817, -0.844,  1.022,  0.768,  1.053],
       ..., 
       [ 1.142,  1.175,  1.394, -1.69 , -1.77 , -1.836],
       [ 1.187,  1.221,  1.403, -1.706, -1.752, -1.796],
       [ 1.514,  1.216,  1.365, -1.72 , -1.701, -1.883]])

Filtering on date ranges is simple


In [30]:
trend['2012-01-01':].head()


Out[30]:
hipster modcloth gumtree perth yellow pages windows installer techno
2012-01-01 1.411 1.192 1.774 -1.077 -1.134 -1.285
2012-01-08 1.513 1.111 1.579 -0.995 -1.183 -1.189
2012-01-15 1.523 1.427 1.613 -1.027 -1.161 -1.337
2012-01-22 1.600 1.490 1.514 -1.140 -1.177 -1.345
2012-01-29 1.459 1.561 1.511 -1.046 -1.224 -1.233

5 rows × 6 columns


In [31]:
trend['2012-01-01': '2013-01-01'].tail(3)


Out[31]:
hipster modcloth gumtree perth yellow pages windows installer techno
2012-12-16 1.645 1.175 1.407 -1.433 -1.515 -1.687
2012-12-23 1.591 1.695 1.625 -1.698 -1.655 -1.504
2012-12-30 1.596 1.515 1.868 -1.515 -1.598 -1.674

3 rows × 6 columns

We can also grab a single date, or a subset of columns


In [32]:
trend.ix['2012-01-01', ['hipster', 'modcloth']]


Out[32]:
hipster     1.411
modcloth    1.192
Name: 2012-01-01 00:00:00, dtype: float64

Or do some boolean filtering


In [33]:
trend[trend.techno < 0].head()


Out[33]:
hipster modcloth gumtree perth yellow pages windows installer techno
2004-04-11 -0.510 -0.817 -0.844 0.521 0.301 -0.270
2006-01-29 -0.838 -0.817 -0.844 1.421 1.309 -0.081
2006-06-25 -0.799 -0.817 -0.833 1.142 1.458 -0.070
2010-01-24 -0.454 -0.183 -0.107 -0.017 0.053 -0.010
2010-01-31 -0.381 -0.276 -0.142 0.187 0.116 -0.044

5 rows × 6 columns

Plotting is built in and easier for dates than matplotlib


In [34]:
_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)


We can also do it for a single column


In [35]:
_ = trend.hipster.cumsum().plot()


Or split the columns out to subplots


In [36]:
axs = trend.plot(subplots=True, figsize=(10, 10))


Resampling data is also straight forward.


In [37]:
# resample by month
trend.resample('M', how='mean').head()


Out[37]:
hipster modcloth gumtree perth yellow pages windows installer techno
2004-01-31 -0.90125 -0.817 -0.844 1.13125 0.84475 0.96325
2004-02-29 -0.74780 -0.817 -0.844 0.69800 0.89000 0.76380
2004-03-31 -0.78950 -0.817 -0.844 0.35650 0.73125 1.09175
2004-04-30 -0.70400 -0.817 -0.844 0.48125 0.89125 0.41950
2004-05-31 -0.81820 -0.817 -0.844 0.34780 0.62040 0.72860

5 rows × 6 columns

and Here by year, but one can do business day, week, month, quarter annual and a bunch of others


In [38]:
# resample by year
_ = trend.resample('A', how='mean').plot(figsize=(10, 10))


Other fancy plots include a scatter matrix including a kernel density estimation (KDE)


In [39]:
# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')


Titanic: Machine Learning from Disaster (kaggle.com)

Load the data, explore it and learn from it


In [40]:
df = pd.read_csv('train.csv', header=0)

In [41]:
df.head()


Out[41]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

5 rows × 12 columns

Lets look at the data types here (this time they're heterogeneous)


In [42]:
df.dtypes


Out[42]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can also get a more verbose summary


In [43]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)


In [44]:
df_grouped = df.groupby(['Pclass', 'Sex'])

In [45]:
df_grouped[['Age', 'Survived']].mean()


Out[45]:
Age Survived
Pclass Sex
1 female 34.611765 0.968085
male 41.281386 0.368852
2 female 28.722973 0.921053
male 30.740707 0.157407
3 female 21.750000 0.500000
male 26.507589 0.135447

6 rows × 2 columns

Histograms are straightforward


In [46]:
ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')


Out[46]:
<matplotlib.text.Text at 0x7576ed0>

So are boxplots


In [47]:
bp = df.boxplot(column='Age', by='Pclass', grid=False)
for i in set(df.Pclass):
    y = df.Age[df.Pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)


If we want to do some learning on this data.. lets convert gender to a binary numeric


In [48]:
df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()


Out[48]:
Sex isFemale
0 male 0
1 female 1
2 female 1
3 female 1
4 male 0

5 rows × 2 columns

Find non-numeric columns so we can drop them later


In [49]:
drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols


Out[49]:
Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')

In [50]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
isFemale       891 non-null int64
dtypes: float64(2), int64(6), object(5)

Setup our data to learn from


In [51]:
X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1) 
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()


   Pclass  Age  SibSp  Parch     Fare  isFemale
0       3   22      1      0   7.2500         0
1       1   38      1      0  71.2833         1
2       3   26      0      0   7.9250         1
3       1   35      1      0  53.1000         1
4       3   35      0      0   8.0500         0

[5 rows x 6 columns]

Have a quick look at the class distribution


In [52]:
y.groupby(y.values).count()


Out[52]:
0    549
1    342
dtype: int64

and fill in some NaNs for age


In [53]:
X['Age'] = X.Age.fillna(X.Age.median())

scikit-learn

Machine Learning in Python

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

What is Machine Learning?

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia

Prediction with scikit-learn is easy - who will survive?


In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc

In [55]:
# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'


Accuracy Score: 0.80
Not too bad

Cross-validation is a fairer performance estimate


In [56]:
from sklearn.cross_validation import KFold

In [57]:
cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
    clf = LogisticRegression()
    clf.fit(X.ix[train], y.ix[train])
    preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)


Accuracy Score: 0.79

And cross-validation can be done more easily


In [58]:
# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score

# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here

print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


[ 0.81564246  0.76966292  0.75842697  0.83707865  0.78651685]
Accuracy: 0.79 (+/- 0.06)

dealing with categorical data


In [59]:
df.Embarked.head()


Out[59]:
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

In [60]:
set(df.Embarked.fillna('O'))


Out[60]:
{'C', 'O', 'Q', 'S'}

Use the LabelEncoder


In [61]:
from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_


Out[61]:
array(['C', 'O', 'Q', 'S'], dtype=object)

In [62]:
X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()


Out[62]:
0    3
1    0
2    3
3    3
4    3
Name: Embarked, dtype: int64

tuning classifier parameters


In [63]:
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
    clf = LogisticRegression(C=C, penalty='l1')
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    
    print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
          .format(C, scores.mean(), scores.std() * 2))


n_estimators: 0.001	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.010	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.100	Accuracy: 0.79 (+/- 0.05)
n_estimators: 1.000	Accuracy: 0.80 (+/- 0.06)
n_estimators: 10.000	Accuracy: 0.79 (+/- 0.04)
n_estimators: 100.000	Accuracy: 0.79 (+/- 0.04)

Comparing classifiers is easy


In [64]:
# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA",
         "QDA", "Logistic Regression"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA(),
    LogisticRegression(class_weight='auto')]

In [66]:
# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    res.append(scores.mean())

In [67]:
import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
        xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)


Models can be pickled


In [69]:
# models can be saved
import pickle
s = pickle.dumps(clf)

And there is a whole lot scikit-learn can do..

  • ###supervised learning
  • ###model evaluation
  • ###unsupervised learning
  • ###feature selection
  • ###feature extraction

by Andreas Mueller